Cubic regression spline is a form of generalized linear models in regression analysis. Also known as B-spline, it is supported by a series of interior basis functions on the interval with chosen knots. Cubic regression splines are widely used on modeling nonlinear data and interaction between variables. Unlike traditional methods such as polynomial regression or broken stick regression, cubic regression spline takes both smoothness and local influence into consideration (Faraway, 2015). According to Julian J Faraway, the basis functions need to be a continuous cubic polynomial that is nonzero on an interval defined by knots and zero anywhere else to ensure the local influence. Also smoothness require their first and second derivatives at each knotpoint to be continuous (Faraway, 2015). More details of cubic regression spline can be found at link
In this project, we’ll be using data set uswages from R package faraway. The data set can be found at link The description of variables are as follows:
kable(var_explain, caption = "Variable description")
| Variables | Type | Explanation |
|---|---|---|
| wage | double | Real weekly wages in dollars |
| educ | integer | Years of education |
| exper | integer | Years of experience |
| race | integer | 1 if Black, 0 if White (other races not in sample) |
| smsa | integer | 1 if living in Standard Metropolitan Statistical Area, 0 if not |
| ne | integer | 1 if living in the North East |
| mw | integer | 1 if living in the Midwest |
| so | integer | 1 if living in the West |
| we | integer | 1 if living in the South |
| pt | integer | 1 if working part time, 0 if not |
In this tutorial, we will focus on the relationship between response variable wage and prediction variable experience. The two-dimensional relationship is easier to present and allows us to better illustrate cubic regression spline method.
Regression spline methods are easy to apply with R packages splines. To load data and draw plots, package faraway and ggplot2 are also needed.
library(splines)
library(faraway)
library(ggplot2)
library(dplyr)
data(uswages)
ggplot(uswages, aes(exper, wage))+geom_point(col = "slategrey") + ggtitle("Relationship between weekly wages and years of experience")
From the plot, we can observe that there are a few outliers with extremely high wage. To avoid the influence of outliers on regression models, we remove observations with wage larger than 4000.
uswages = filter(uswages, wage<4000)
First, let’s try to capture the relationship with ordinary least square model (ols).
fit1 = lm(wage~exper, data = uswages)
plot(uswages$exper, uswages$wage, xlab = "Weekly wage",
ylab = "Experience", main = "OLS model", col = "slategrey")
abline(fit1, col = "red")
From the plot, we can see that OLS model fails to catch most of the variabnce in the data.
Polynomial regression is a good alternative in this case. The linear models with polynomial of degree 2 and degree 4 are shown as follows:
g2 = lm(wage~poly(exper, 2), data = uswages)
g4 = lm(wage~poly(exper, 4), data = uswages)
uswages = mutate(uswages, degree2 = fitted(g2), degree4 = fitted(g4))
ggplot(uswages, aes(exper, wage)) +
geom_point( col = "slategrey") +
geom_line(aes(exper, degree2,color = "2"))+
geom_line(aes(exper, degree4,color = "4")) +
scale_color_manual(values = c(
'2' = 'darkblue',
'4' = 'red')) +
labs(color = 'Polynomial degree')+
ggtitle("Polynomial regression models")
cubic_spline = lm(wage~bs(exper, knots = c(0, 20, 40, 58)), data = uswages)
uswages = mutate(uswages, smooth = fitted(cubic_spline))
ggplot(uswages, aes(exper, wage)) +
geom_point(col = "slategrey") +
geom_line(aes(exper, smooth), col = "red") +
ggtitle("Cubic regression spline model")
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from math import sqrt
from sklearn.metrics import mean_squared_error
from patsy import dmatrix
from scipy import interpolate
# read dataset
data = pd.read_csv("uswages.csv")
print(data.head())
## Unnamed: 0 wage educ exper race smsa ne mw so we pt
## 0 6085 771.60 18 18 0 1 1 0 0 0 0
## 1 23701 617.28 15 20 0 1 0 0 0 1 0
## 2 16208 957.83 16 9 0 1 0 0 1 0 0
## 3 2720 617.28 12 24 0 1 1 0 0 0 0
## 4 9723 902.18 14 12 0 1 0 1 0 0 0
print(data.describe())
## Unnamed: 0 wage ... we pt
## count 2000.000000 2000.000000 ... 2000.00000 2000.000000
## mean 14015.171000 608.117865 ... 0.21000 0.092500
## std 8099.947724 459.832629 ... 0.40741 0.289803
## min 7.000000 50.390000 ... 0.00000 0.000000
## 25% 7028.750000 308.640000 ... 0.00000 0.000000
## 50% 13946.500000 522.320000 ... 0.00000 0.000000
## 75% 21117.250000 783.480000 ... 0.00000 0.000000
## max 28140.000000 7716.050000 ... 1.00000 1.000000
##
## [8 rows x 11 columns]
print(data.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2000 entries, 0 to 1999
## Data columns (total 11 columns):
## Unnamed: 0 2000 non-null int64
## wage 2000 non-null float64
## educ 2000 non-null int64
## exper 2000 non-null int64
## race 2000 non-null int64
## smsa 2000 non-null int64
## ne 2000 non-null int64
## mw 2000 non-null int64
## so 2000 non-null int64
## we 2000 non-null int64
## pt 2000 non-null int64
## dtypes: float64(1), int64(10)
## memory usage: 172.0 KB
## None
We can see that wage and experience has a non-linear relationship.
## exper vs wage ##
data_x = data[['exper']]
data_y = data[['wage']]
#visualize the relationship between experience and wage
plt.scatter(data_x, data_y, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.suptitle('Fig 1. Relationship between experience and wage', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
Looking at Fig 1., there seems to be an outlier (the wage that is above 7000), therefore, in this tutorial, we will remove that outlier and continue our model building.
# remove outlier
data_ylim = data.loc[data['wage']<= 7000]
wage = data_ylim[['wage']]
exper_x = data_ylim[['exper']]
#visualize the relationship between experience and wage
plt.clf()
plt.scatter(exper_x, wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.suptitle('Fig 2. Relationship between experience and wage (remove outlier)', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
#add an intercept (beta_0) to our model
exper_x = sm.add_constant(exper_x)
# model fitting
model = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions1 = model.predict(exper_x)
print(model.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: wage R-squared: 0.029
## Model: OLS Adj. R-squared: 0.029
## Method: Least Squares F-statistic: 59.88
## Date: Tue, 27 Nov 2018 Prob (F-statistic): 1.59e-14
## Time: 10:16:16 Log-Likelihood: -14935.
## No. Observations: 1999 AIC: 2.987e+04
## Df Residuals: 1997 BIC: 2.989e+04
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 503.1140 16.198 31.060 0.000 471.347 534.881
## exper 5.5164 0.713 7.738 0.000 4.118 6.915
## ==============================================================================
## Omnibus: 1156.252 Durbin-Watson: 1.957
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 16087.929
## Skew: 2.445 Prob(JB): 0.00
## Kurtosis: 16.009 Cond. No. 38.7
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
By looking at Fig 3., apparently, we can’t use simple linear regression to explain the relationship between wage and experience, since there exists a non-linear association between wage and experience.
# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(exper_x['exper'], predictions1, color = 'green', linewidth = 1.5)
plt.suptitle('Fig 3. Relationship between experience and wage: using simple linear regression', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
# Calculating RMSE value
rms1 = sqrt(mean_squared_error(wage, predictions1))
print(rms1)
## 425.13535436123
# refit model using polynomial regression ("exper" with degree = 2)
exper_x['exper2'] = np.power(exper_x['exper'], 2)
# model fitting
model2 = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions2 = model2.predict(exper_x)
print(model2.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: wage R-squared: 0.133
## Model: OLS Adj. R-squared: 0.132
## Method: Least Squares F-statistic: 152.9
## Date: Tue, 27 Nov 2018 Prob (F-statistic): 1.65e-62
## Time: 10:16:16 Log-Likelihood: -14822.
## No. Observations: 1999 AIC: 2.965e+04
## Df Residuals: 1996 BIC: 2.967e+04
## Df Model: 2
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 268.2870 21.573 12.436 0.000 225.979 310.595
## exper 38.8749 2.262 17.190 0.000 34.440 43.310
## exper2 -0.7334 0.047 -15.453 0.000 -0.826 -0.640
## ==============================================================================
## Omnibus: 1214.317 Durbin-Watson: 1.969
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 19762.288
## Skew: 2.557 Prob(JB): 0.00
## Kurtosis: 17.530 Cond. No. 1.97e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.97e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
The polynomial curve on Fig 4. does explain the non-linear relationship between wage and experience.
# reduce samples down to 100
x_lim = np.linspace(start = exper_x['exper'].min(), stop = exper_x['exper'].max(), num = 100)
x_lim_df = pd.DataFrame({'exper':x_lim})
x_lim_df['exper2'] = np.power(x_lim_df['exper'], 2)
x_lim_df = sm.add_constant(x_lim_df)
# find fitted value using x_lim
fit_reduce = model2.predict(x_lim_df)
# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(x_lim_df[['exper']], fit_reduce, color = 'blue', linewidth = 1.5, label='experience with degree = 2')
plt.legend()
plt.suptitle('Fig 4. Relationship between experience and wage: using polynomial regression', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
# Calculating RMSE value
rms2 = sqrt(mean_squared_error(wage, predictions2))
print(rms2)
## 401.78144699617144
# cubic spline with 4 knots at 5, 15, 25, 40
cubic_x = dmatrix("bs(data, knots = (0, 20, 40, 57), include_intercept = False)", {"data": exper_x[['exper']]}, return_type = 'dataframe')
# model fitting
model3 = sm.GLM(wage, cubic_x).fit()
# find fitted value
predictions3 = model3.predict(cubic_x)
print(model3.summary())
# reduce samples down to 100
## Generalized Linear Model Regression Results
## ==============================================================================
## Dep. Variable: wage No. Observations: 1999
## Model: GLM Df Residuals: 1991
## Model Family: Gaussian Df Model: 7
## Link Function: identity Scale: 1.6183e+05
## Method: IRLS Log-Likelihood: -14821.
## Date: Tue, 27 Nov 2018 Deviance: 3.2220e+08
## Time: 10:16:16 Pearson chi2: 3.22e+08
## No. Iterations: 3 Covariance Type: nonrobust
## ===============================================================================================================================
## coef std err z P>|z| [0.025 0.975]
## -------------------------------------------------------------------------------------------------------------------------------
## Intercept 193.5365 261.141 0.741 0.459 -318.290 705.363
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[0] 12.3372 274.287 0.045 0.964 -525.255 549.929
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[1] 302.5933 259.491 1.166 0.244 -205.999 811.186
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[2] 709.6090 272.728 2.602 0.009 175.071 1244.147
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[3] 482.8562 267.447 1.805 0.071 -41.330 1007.042
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[4] 193.8713 284.069 0.682 0.495 -362.894 750.637
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[5] 31.3605 323.113 0.097 0.923 -601.930 664.651
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[6] -28.7265 479.608 -0.060 0.952 -968.741 911.288
## ===============================================================================================================================
x_lim = np.linspace(exper_x[['exper']].min(), exper_x[['exper']].max(), 100)
# find fitted value using x_lim
fit_reduce2 = model3.predict(dmatrix("bs(train, knots = (0, 20, 40, 57), include_intercept = False)", {"train": x_lim}, return_type = 'dataframe'))
The spline curve on Fig 5. does explain the non-linear relationship between wage and experience.
# plot spline
plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(x_lim, fit_reduce2, color='r', label='Specifying 4 knots, knots = (0, 20, 40, 57)')
plt.legend()
plt.suptitle('Fig 5. Relationship between experience and wage: using cubic regression', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
# Calculating RMSE value
rms3 = sqrt(mean_squared_error(wage, predictions3))
print(rms3)
## 401.4743219404939
# overlay three regression curve
plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(exper_x['exper'], predictions1, color = 'green', linewidth = 1.5, label = 'Simple Linear Regression')
plt.plot(x_lim_df['exper'], fit_reduce, color = 'blue', linewidth = 1.5, label='Polynomial Regression, experience degree = 2')
plt.plot(x_lim, fit_reduce2, color='r', linewidth = 1.5, label='Cubic Regrssion, knots = (0, 20, 40, 57)')
plt.legend()
plt.suptitle('Fig 6. Relationship between experience and wage', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
# compare mse
model = ['SLR', 'Polynomial', 'Spline']
RMSE = [rms1, rms2, rms3]
compare = pd.DataFrame({'Model':model, 'RMSE':RMSE})
print(compare)
## Model RMSE
## 0 SLR 425.135354
## 1 Polynomial 401.781447
## 2 Spline 401.474322
FARAWAY, J. J. (2015). Linear Models With R, Second Edition (2nd ed.). Boca Raton, FL: Taylor & Francis.